Skip to content

Simulator optimization branch: GPU density matrix, perf, audits, shar…#290

Merged
ciaranra merged 1 commit intodevfrom
sim-opt
Apr 13, 2026
Merged

Simulator optimization branch: GPU density matrix, perf, audits, shar…#290
ciaranra merged 1 commit intodevfrom
sim-opt

Conversation

@ciaranra
Copy link
Copy Markdown
Member

…ed context

Major work on branch (squash of 28 commits + final clippy cleanup):

GPU simulator optimizations

  • f32 gate fusion, commuting-gate reorder, CY/SWAP/RXX/RYY direct shaders
  • Persistent kernel for small states with dynamic shared-memory sizing
  • CPU measurement fast path for small states (f32 <=16q, f64 <=15q)
  • Parallel CX/CZ/RZZ/RXX/RYY via rayon when .parallel(true)
  • Parallel scalar CX path for low-qubit pairs (q_lo < 2)
  • Fused flush_gates + state() readback into single encoder
  • Adaptive mz path selection: empirical N/M lookup table replacing hardcoded threshold
  • Raised StateVecSoA parallel threshold 14 -> 21 qubits
  • Exploration benchmarks for adaptive path decisions

GpuDensityMatrix

  • Choi-Jamiolkowski representation on top of GpuStateVec
  • Generic over backend (f32 / f64); gates, noise channels, helpers
  • Cholesky re-purification for mixed states

Correctness fixes (from audits)

  • GPU 2q rotation shaders (RXX/RYY bit_a==bit_b bug)
  • DensityMatrix phase/amplitude damping trace preservation
  • mz is_deterministic flag (previously hardcoded false)
  • GpuPauliProp gate ordering + Pauli X/Y/Z semantics, stale buffer reads
  • GpuDensityMatrix mz probability formula
  • GpuStabMulti::mz_queue now snapshots state at call time

Shared GPU context (concurrent-use SIGSEGV fix)

  • Process-wide OnceLock in pecos-gpu-sims/src/gpu_probe.rs
  • All 7 simulators now reuse one wgpu Instance/Adapter/Device/Queue
  • Fixes crashes under cargo's parallel test harness and MonteCarloEngine shots
  • Removed the --test-threads=1 workaround from pecos-cli rust test

Test infrastructure

  • New audits: gate_audit, gate_fuzz, pauli_prop_audit, influence_sampler_audit, large_n_audit, noisy_sampler_stats, stab_extra_audits, extra_audits, flush_blocked_audit, concurrent_gpu_test
  • Removed pecos-quest / pecos-qulacs wrapper crates (bench code only)
  • Removed dangling quest_sim_test.rs and quest_example.rs

Clippy / lint cleanup

  • GpuError::Startup variant wraps GpuStartupError
  • Internal GPU constants now usize (casts removed)
  • Renamed GatePipeline variants SWAP/RXX/RYY/RZZ -> Swap/Rxx/Ryy/Rzz
  • Panics / # Errors doc sections added where clippy required

  • Cholesky loops allow needless_range_loop with justification
  • Test files allow cast_possible_truncation / cast_precision_loss with rationale

@ciaranra ciaranra force-pushed the sim-opt branch 9 times, most recently from e247f4f to 4787c14 Compare April 13, 2026 14:32
…ed context

Major work on branch (squash of 28 commits + final clippy cleanup):

GPU simulator optimizations
- f32 gate fusion, commuting-gate reorder, CY/SWAP/RXX/RYY direct shaders
- Persistent kernel for small states with dynamic shared-memory sizing
- CPU measurement fast path for small states (f32 <=16q, f64 <=15q)
- Parallel CX/CZ/RZZ/RXX/RYY via rayon when .parallel(true)
- Parallel scalar CX path for low-qubit pairs (q_lo < 2)
- Fused flush_gates + state() readback into single encoder
- Adaptive mz path selection: empirical N/M lookup table replacing hardcoded threshold
- Raised StateVecSoA parallel threshold 14 -> 21 qubits
- Exploration benchmarks for adaptive path decisions

GpuDensityMatrix
- Choi-Jamiolkowski representation on top of GpuStateVec
- Generic over backend (f32 / f64); gates, noise channels, helpers
- Cholesky re-purification for mixed states

Correctness fixes (from audits)
- GPU 2q rotation shaders (RXX/RYY bit_a==bit_b bug)
- DensityMatrix phase/amplitude damping trace preservation
- mz is_deterministic flag (previously hardcoded false)
- GpuPauliProp gate ordering + Pauli X/Y/Z semantics, stale buffer reads
- GpuDensityMatrix mz probability formula
- GpuStabMulti::mz_queue now snapshots state at call time

Shared GPU context (concurrent-use SIGSEGV fix)
- Process-wide OnceLock<GpuDeviceContext> in pecos-gpu-sims/src/gpu_probe.rs
- All 7 simulators now reuse one wgpu Instance/Adapter/Device/Queue
- Fixes crashes under cargo's parallel test harness and MonteCarloEngine shots
- Removed the --test-threads=1 workaround from pecos-cli rust test

Test infrastructure
- New audits: gate_audit, gate_fuzz, pauli_prop_audit, influence_sampler_audit,
  large_n_audit, noisy_sampler_stats, stab_extra_audits, extra_audits,
  flush_blocked_audit, concurrent_gpu_test
- Removed pecos-quest / pecos-qulacs wrapper crates (bench code only)
- Removed dangling quest_sim_test.rs and quest_example.rs

Clippy / lint cleanup
- GpuError::Startup variant wraps GpuStartupError
- Internal GPU constants now usize (casts removed)
- Renamed GatePipeline variants SWAP/RXX/RYY/RZZ -> Swap/Rxx/Ryy/Rzz
- # Panics / # Errors doc sections added where clippy required
- Cholesky loops allow needless_range_loop with justification
- Test files allow cast_possible_truncation / cast_precision_loss with rationale
@ciaranra ciaranra marked this pull request as ready for review April 13, 2026 16:26
@ciaranra ciaranra merged commit 7651dce into dev Apr 13, 2026
88 checks passed
@ciaranra ciaranra deleted the sim-opt branch April 13, 2026 16:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant